perm filename TTT[4,KMC] blob sn#026715 filedate 1973-02-28 generic text, type T, neo UTF8
00100		HOW TO USE AND HOW NOT TO USE TURING-LIKE TESTS
00200	        IN EVALUATING THE ADEQUACY OF SIMULATION MODELS
00300	
00400	               KENNETH MARK COLBY
00500	                    AND 
00600	              FRANKLIN DENNIS HILF
00700	
00800		It is very easy to become confused about  Turing's  Test.  In
00900	part  this  is  due  to  Turing himself who introduced the now-famous
01000	imitation game in a  1950  paper  entitled  COMPUTING  MACHINERY  AND
01100	INTELLIGENCE [3 ].  A careful reading of this paper reveals there are
01200	actually two games proposed , the second of which is commonly  called
01300	Turing's test.
01400		In the first imitation game  two  groups  of  judges  try  to
01500	determine which of two interviewees is a woman. Communication between
01600	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
01700	informed  that  one  of the interviewees is a woman and one a man who
01800	will pretend to be a woman. After the interview, the judge  is  asked
01900	what  we shall call the woman-question i.e. which interviewee was the
02000	woman?  Turing does not say what else  the  judge  is  told  but  one
02100	assumes  the  judge is NOT told that a computer is involved nor is he
02200	asked to determine which  interviewee  is  human  and  which  is  the
02300	computer.  Thus,  the  first  group  of  judges  would  interview two
02400	interviewees:    a woman, and a man pretending to be a woman.
02500		The  second  group  of judges would be given the same initial
02600	instructions, but unbeknownst to them, the two interviewees would  be
02700	a  woman  and a computer programmed to imitate a woman.   Both groups
02800	of judges  play  this  game  until  sufficient  statistical  data are
02900	collected  to  show  how  often the right identification is made. The
03000	crucial question then is:  do the judges decide wrongly AS OFTEN when
03100	the  game  is  played  with man and woman as when it is played with a
03200	computer substituted  for  the  man.  If  so,  then  the  program  is
03300	considered  to  have  succeeded in imitating a woman as well as a man
03400	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
03500	woman-question  in  this  game,  judges  are not required to identify
03600	which interviewee is human and which is machine.
03700		Later  on  in  his  paper  Turing proposes a variation of the
03800	first game. In the second game one interviewee is a man and one is  a
03900	computer.   The judge is asked to determine which is man and which is
04000	machine, which we shall call the machine-question. It is this version
04100	of  the game which is commonly thought of as Turing's test.    It has
04200	often been suggested as a means of validating computer simulations of
04300	psychological processes.
04400		In the course of testing a  simulation  (PARRY)  of  paranoid
04500	linguistic behavior in a psychiatric interview, we conducted a number
04600	of   Turing-like   indistinguishability   tests   [1].      We    say
04700	`Turing-like' because none of them consisted of playing the two games
04800	described above. We chose not to play these games  for  a  number  of
04900	reasons  which  can  be  summarized  by  saying that they do not meet
05000	modern criteria for good experimental design.  In designing our tests
05100	we  were  primarily  interested in learning more about developing the
05200	model.  We did not believe the simple machine-question to be a useful
05300	one   in   serving   the  purpose  of  progressively  increasing  the
05400	credibility of the model but we investigated a  variation  of  it  to
05500	satisfy the curiosity of colleagues in artificial intelligence.
05600		In this design eight psychiatrists  interviewed  by  teletype
05700	two  patients,  one  being PARRY and one being an actual
05800	hospitalized paranoid patient.   The interviewers were  not  informed
05900	that  a simulation was involved nor were they asked to identify which
06000	was the machine. Their task was to conduct a  diagnostic  psychiatric
06100	interview  and  rate  each  response  from the `patients' along a 0-9
06200	scale  of  paranoidness,  0  meaning  zero  and  9   being   highest.
06300	Transcripts   of   these  interviews,  without  the  ratings  of  the
06400	interviewers, were then utilized for  various  experiments  in  which
06500	randomly   selected   expert  judges  conducted  evaluations  of  the
06600	interview transcripts.   For example, in one experiment it was  found
06700	that patients and model were indistinguishable along the dimension of
06800	paranoidness.
06900		To  ask  the machine-question, we sent interview transcripts,
07000	one with a patient and one  with  PARRY,  to  100  psychiatrists
07100	randomly  selected from the Directory of American Specialists and the
07200	Directory of the American Psychiatric Association. Of the 41  replies
07300	21  (51%)  made the correct identification while 20 (49%) were wrong.
07400	Based on this random sample of 41 psychiatrists we are 95%  confident
07500	that  between  35.9%  and  66.5%  of all psychiatrists could make the
07600	correct identification, a figure which is close to  a  chance  level.
07700	(Our  statistical  consultant  was  Dr.  Helena C.  Kraemer, Research
07800	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
07900	University.)
08000		Psychiatrists are expert judges of patient interview behavior
08100	but  they  are unfamiliar with computers. Hence we conducted the same
08200	test  with  100  computer  scientists  randomly  selected  from   the
08300	membership list of the Association for Computing Machinery, ACM.   Of
08400	the 67 replies 32 (48%) were right and 35 (52%) were wrong. Based  on
08500	this  random  sample  of  67 computer scientists we are 95% confident
08600	that between 36% and 60% of all computer scientists  could  make  the
08700	correct identification, a range close to that expected by chance.
08800		Thus the answer to this machine-question "can expert  judges,
08900	psychiatrists  aand  computer scientists, using teletyped transcripts
09000	of psychiatric interviews, distinguish between paranoid patients  and
09100	a  simulation  of paranoid processes? " is "No". But what do we learn
09200	from this?   It is some comfort that the answer was not "yes"and  the
09300	null  hypothesis  (no  differences) failed to be rejected, especially
09400	since statistical tests are somewhat biased in favor of rejecting the
09500	null  hypothesis [3].  Yet this answer does not tell us what we would
09600	most like to know, i.e. how to improve the model.  Simulation  models
09700	do  not spring forth in a complete, perfect and final form; they must
09800	be gradually developed over time. Pehaps  we  might  obtain  a  "yes"
09900	answer to the machine-question if we allowed a large number of expert
10000	judges to conduct the  interviews  themselves  rather  than  studying
10100	transcripts  of  other  interviewers.     It  would indicate that the
10200	model must be improved but unless we systematically investigated  how
10300	the  judges  succeeded in making the discrimination we would not know
10400	what aspects of the model to work on. The logistics of such a  design
10500	are  immense  and obtaining a large N of judges for sound statistical
10600	inference  would  require   an   effort   disproportionate   to   the
10700	information-yield.
10800		A more efficient and informative way to use Turing-like tests
10900	is to ask judges to make ordinal ratings along scaled dimensions from
11000	teletyped  interviews.     We  shall  term  this  approach asking the
11100	dimension-question.   One can then compare scaled ratings received by
11200	the patients and by the model to precisely determine where and by how
11300	much they differ.        Model builders  strive  for  a  model  which
11400	shows     indistinguishability     along    some    dimensions    and
11500	distinguishability along others. That is, the model converges on what
11600	it is supposed to simulate and diverges from that which it is not.
11700		We  mailed  paired-interview  transcripts  to   another   400
11800	randomly  selected psychiatrists asking them to rate the responses of
11900	the two `patients' along certain dimensions. The judges were  divided
12000	into  groups,  each  judge  being asked to rate responses of each I-O
12100	pair in the interviews along four dimensions.  The  total  number  of
12200	dimensions  in  this  test  were twelve- linguistic noncomprehension,
12300	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
12400	ideas  of  reference, delusions, mistrust, depression, suspiciousness
12500	and mania. These are dimensions which psychiatrists commonly  use  in
12600	evaluating patients.
12700		Table 1 shows there were significant differences, with  PARRY
12800	receiving   higher   scores   along   the  dimensions  of  linguistic
12900	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
13000	suspiciousness. On the dimension of delusions the patients were rated
13100	significantly higher. There were no significant differences along the
13200	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
13300	depression and mania.
13400		While    tests    asking    the   machine-question   indicate
13500	indistinguishability at  the  gross  level,  a  study  of  the  finer
13600	structure  os  the  model's  behavior  through  ratings  along scaled
13700	dimensions  shows  statistically  significant   differences   between
13800	patients  and  model.     These  differences are of help to the model
13900	builder in suggesting which aspects of the model must be modified and
14000	improved  in  order  to  be  considered an adequate simulation of the
14100	class of paranoid patients it is intended to simulate.  For  example,
14200	it  is  clear  that  PARRY'S language-comprehension must be improved.
14300	Once this has been implemented, a future test will  tell  us  whether
14400	improvement has occurred and by how much in comparison to the earlier
14500	version.   Successive identification of particular areas  of  failure
14600	in the model permits their improvement and the development of
14700	more adequate model-versions.
14800		Further evidence that the machine-question is too coarse  and
14900	insensitive  a test comes from the following experiment. In this test
15000	we constructed a random version of the paranoid model which  utilized
15100	PARRY'S  output statements but expressed them randomly no matter what
15200	the interviewer said.   Two psychiatrists conducted  interviews  with
15300	this  model, transcripts of which were paired with patient interviews
15400	and sent to 200  randomly  selected  psychiatrists  asking  both  the
15500	machine-question  and  the dimension-question.  Of the 69 replies, 34
15600	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
15700	69 psychiatrists we are 95% confident that between 39% and 63% of all
15800	psychiatrists would make the correct identification, again indicating
15900	a chance level.   However as shown in Table 2 significant differences
16000	appear along the dimensions of linguistic  noncomprehension,  thought
16100	disorder  and  bizarreness,  with RANDOM-PARRY rated higher. On these
16200	particular dimensions we can  construct  a  continuum  in  which  the
16300	random  version  represents one extreme, the actual patients another.
16400	Our (nonrandom) PARRY lies  somewhere  between  these  two  extremes,
16500	indicating  that  it  performs  significantly  better than the random
16600	version but still requires improvement before being indistinguishable
16700	from patients.(See Fig.1). Hence this approach provides yardsticks for
16800	measuring the adequacy of this or any other dialogue simulation model
16900	along the relevant dimensions.
17000		We  conclude  that  when model builders want to conduct tests
17100	which indicate in which direction  progress  lies  and  to  obtain  a
17200	measure  of  whether  progress  is  being  achieved,  the  way to use
17300	Turing-like tests is to ask  expert  judges  to  make  ratings  along
17400	multiple  dimensions considered essential to the model.  Useful tests
17500	do not prove a model, they probe it  for  its  sensitivities.  Simply
17600	asking   the  machine-question  yields  no  information  relevant  to
17700	improving what the model builder knows is only a first approximation.
17800	
17900	
18000			REFERENCES
18100	
18200	[1]  Colby,  K.M.,  Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
18300	indistinguishability  tests  for  the  validation   of   a   computer
18400	simulation   of   paranoid   processes.   ARTIFICIAL  INTELLIGENCE,3,
18500	(1972),199-221.
18600	
18700	[2]  Meehl,  P.E.,  Theory  testing  in  psychology  and  physics:  a
18800	methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
18900	
19000	[3]  Turing,A.  Computing  machinery  and intelligence. Reprinted in:
19100	COMPUTERS  AND  THOUGHT  (Feigenbaum,  E.A.  and  Feldman,  J.,eds.).
19200	McGraw-Hill, New York,1963,pp. 11-35.